Support vector machines for spam categorization
نویسندگان
چکیده
We study the use of support vector machines (SVM's) in classifying e-mail as spam or nonspam by comparing it to three other classification algorithms: Ripper, Rocchio, and boosting decision trees. These four algorithms were tested on two different data sets: one data set where the number of features were constrained to the 1000 best features and another data set where the dimensionality was over 7000. SVM's performed best when using binary features. For both data sets, boosting trees and SVM's had acceptable test performance in terms of accuracy and speed. However, SVM's had significantly less training time.
منابع مشابه
Facebook Page Spam detection using Support Vector Machines based on n-gram model
With social networks like Facebook, twitter reaching to the common masses, these have become the best target for spammers. The newest way to mislead and fraud viewers is Page Spam . Viewers are deceived to click on links to spam their connections, redirect to a fraudulent business or spread wrong information about famous figures, organizations and causes. This research aims to categorize such p...
متن کاملPerformance Analysis of Naiotave Bayes Classification, Support Vector Machines and Neural Networks for Spam Categorization
Spam mail recognition is a new growing field which brings together the topic of natural language processing and machine learning as it is in essence a two class classification of natural language texts. An important feature of spam recognition is that it is a cost-sensitive classification: misclassification of a non-spam mail as spam is generally a more severe error than misclassifying a spam m...
متن کاملA new feature selection algorithm based on binomial hypothesis testing for spam filtering
Content-based spam filtering is a binary text categorization problem. To improve the performance of the spam filtering, feature selection, as an important and indispensable means of text categorization, also plays an important role in spam filtering. We proposed a new method, named Bi-Test, which utilizes binomial hypothesis testing to estimate whether the probability of a feature belonging to ...
متن کاملA new term-weighting scheme for naïve Bayes text categorization
Purpose – Automatic text categorization has applications in several domains, for example e-mail spam detection, sexual content filtering, directory maintenance, and focused crawling, among others. Most information retrieval systems contain several components which use text categorization methods. One of the first text categorization methods was designed using a naı̈ve Bayes representation of the...
متن کاملA Comparative Performance Study of Feature Selection Methods for the Anti-spam Filtering Domain
In this paper we analyse the strengths and weaknesses of the mainly used feature selection methods in text categorization when they are applied to the spam problem domain. Several experiments with different feature selection methods and content-based filtering techniques are carried out and discussed. Information Gain, χ-text, Mutual Information and Document Frequency feature selection methods ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- IEEE transactions on neural networks
دوره 10 5 شماره
صفحات -
تاریخ انتشار 1999